Using the Literature to Identify Confounding Variables for Drug Safety and Drug Lead Investigator: Scott Malec Institution : University of Pittsburgh School of Medicine E-Mail : sam413@pitt.edu Proposal ID : 1269 Proposal Description: Health data contain a wealth of information for research. Health data, such as found in electronic health records (EHRs), allow for the identification links between health events, such as drug exposures and side-effects. Some of these links indicate stable dependencies deemed as causes. Causal insight allows reverse-engineering disease. If confounding is not addressed, it will be difficult to distinguish causative from correlative links. Our approach is to identify confounders explicitly. Graphical causal modeling (GCMs) can discover causal links from data and prior knowledge. GCMs summarize causal links between variables. Automated selection of variables would allow GCMs to scale and yield more insight from data. Literature-based discovery (LBD) methods were developed to identify links between concepts in the literature. Advanced methods permit the search for concepts linked to each other through specific verbs, e.g., causes, treats. My hypothesis is that we can exploit structured knowledge extracted from the literature to inform GCMs. In prior work, we found that LBD + GCM was better at identifying side-effects in EHR data than traditional methods. Compared to methods which use solely data, we hypothesize that our method will increase the ability to detect causal relationships from EHR data. The first aim is to determine the extent to which LBD-informed GCM improves the identification of causal links for drug safety. We will build LBD-informed GCMs using publicly available reference datasets for drug safety. These reference datasets contain drug/side-effect pairs for performance benchmarking. (A) Test the ability of GCM algorithms to identify known causal links solely using data. We will systematically evaluate GCM algorithms based on their ability to re-discover causal links in a reference standard. Results will guide our studies on how GCM can be tuned. (B) Determine the effect of adding different subsets of LBD-derived information to GCMs at identifying